Automatic extraction of property norm-like data from large text corpora
نویسنده
چکیده
Traditional methods for deriving property-based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is-a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car--petrol). We propose a system for the challenging task of automatic, large-scale acquisition of unconstrained, human-like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept-relation-feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property-based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human-generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human-judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state-of-the-art, while subsequent evaluations exhibit the human-like character of our generated properties.
منابع مشابه
Automatic extraction of property norm-like features from large text corpora with gold standard, human and semantic-similarity evaluations
Property norms (e.g., banana is yellow, aeroplane has wings) play a key role in cognitive science, forming the basis for many recent theoretical accounts of conceptual representations (e.g., Cree et al., 2006; Grondin et al., 2009; Randall et al., 2004). Such norms are typically derived from norming studies where a large number of human participants elicit properties for a set of concepts (e.g....
متن کاملLarge-Scale Acquisition of Feature-Based Conceptual Representations from Textual Corpora
Methods for estimating people’s conceptual knowledge have the potential to be very useful to theoretical research on conceptual semantics. Traditionally, feature-based conceptual representations have been estimated using property norm data; however, computational techniques have the potential to build such representations automatically. The automatic acquisition of feature-based conceptual repr...
متن کاملVision and Feature Norms: Improving automatic feature norm learning through cross-modal maps
Property norms have the potential to aid a wide range of semantic tasks, provided that they can be obtained for large numbers of concepts. Recent work has focused on text as the main source of information for automatic property extraction. In this paper we examine property norm prediction from visual, rather than textual, data, using cross-modal maps learnt between property norm and visual spac...
متن کاملUsing Decision Trees and Text Mining Techniques for Extending Taxonomies
Lexical taxonomies have tree-like structures and can thus be extended to become decision trees that serve for their own extension. In this paper, a semi-automatic procedure for extending lexical taxonomies is proposed that makes use of term extraction methods for identifying new concepts and that uses cooccurrence data from large corpora to generate the necessary features (semantic descriptions...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Cognitive science
دوره 38 4 شماره
صفحات -
تاریخ انتشار 2013